Early prediction of student academic performance is essential for identifying at-risk students and enabling timely intervention strategies. Traditional single machine learning models often struggle to capture complex relationships within educational data. This study proposes a hybrid stacking ensemble model that integrates Random Forest, Support Vector Machine (SVM), and XGBoost as base learners with Logistic Regression as a meta-learner. The model was evaluated using a publicly available student performance dataset. Experimental results demonstrate that the proposed hybrid model achieved an accuracy of 91.13%, an F1-score of 0.877, and an AUC value of 0.965, outperforming individual classifiers. The findings indicate that ensemble learning significantly enhances predictive performance and reliability in academic early warning systems.
Introduction
Educational institutions are increasingly using data-driven methods to improve student learning outcomes and identify academically at-risk students early. Early prediction allows institutions to provide timely support and reduce dropout rates and academic failure. Although machine learning models such as Random Forest, Support Vector Machine (SVM), and XGBoost have been widely used for predicting academic performance, single models may suffer from limitations like bias, variance, and poor generalization.
To address these issues, this study proposes a hybrid stacking ensemble model that combines multiple classifiers to improve prediction accuracy and robustness. The model uses Random Forest, SVM, and XGBoost as base learners, while Logistic Regression acts as the meta-learner to produce the final prediction. The dataset used includes student demographic, academic, and social attributes, where students with final grades below 10 are classified as “At Risk.” Data preprocessing includes label encoding, feature scaling, and SMOTE to handle class imbalance.
The model’s performance was evaluated using metrics such as accuracy, precision, recall, F1-score, confusion matrix, ROC curve, and AUC. Experimental results show that the hybrid stacking model achieved the best performance with 91.13% accuracy, outperforming individual models like Random Forest, SVM, and XGBoost. The confusion matrix revealed that the model correctly identified most at-risk students with very few misclassifications. Additionally, the ROC-AUC value of 0.965 indicates excellent classification capability.
Conclusion
This study proposed a hybrid stacking ensemble model for predicting student academic performance. The integration of Random Forest, SVM, and XGBoost through a Logistic Regression meta-learner resulted in improved classification performance.
The model achieved 91.13% accuracy and an AUC value of 0.965, outperforming standalone classifiers. The proposed approach can serve as an effective academic early warning system for identifying at-risk students.
Future research may focus on:
• Evaluating the model on larger multi-institutional datasets
• Incorporating deep learning architectures
• Developing real-time deployment systems for institutional use
References
[1] P. Cortez and A. Silva, “Using Data Mining to Predict Secondary School Student Performance,” Proceedings of the 5th Future Business Technology Conference, Porto, Portugal, 2008.
[2] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
[3] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[4] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, pp. 273–297, 1995.
[5] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794.
[6] D. H. Wolpert, “Stacked Generalization,” Neural Networks, vol. 5, no. 2, pp. 241–259, 1992.
[7] S. B. Kotsiantis, “Use of Machine Learning Techniques for Educational Data Mining: A Review of the Literature,” Educational Technology & Society, vol. 15, no. 3, pp. 205–219, 2012.
[8] R. Baker and K. Yacef, “The State of Educational Data Mining in 2009: A Review and Future Visions,” Journal of Educational Data Mining, vol. 1, no. 1, pp. 3–17, 2009.
[9] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011.
[10] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning, Springer, 2013.